import besca as bc
import pkg_resources
The datasets that are already annotated and should be used for training. If you only use one dataset please use list of one.
# the path to the datasets
train_dataset_paths = [pkg_resources.resource_filename('besca', 'datasets/data')]
#the names of the h5ad files
train_datasets = ['Martin2019_processed.h5ad']
The dataset of interest that should be annotated.
test_dataset = 'Smillie2019_processed.h5ad'
test_dataset_path = pkg_resources.resource_filename('besca', 'datasets/data')
Give your analysis a name.
analysis_name = 'auto_annot_Smillie2019_with_Martin2019_Type'
Specify column name of celltype annotation you want to train on.
celltype ='Type'
Choose a method:
method = 'logistic_regression'
Specify merge method if using multiple training datasets. Needs to be either scanorama or naive.
merge = 'scanorama'
Decide if you want to use the raw format or highly variable genes. Raw increases computational time and does not necessarily improve predictions.
use_raw = False
You can choose to only consider a subset of genes from a signature set.
genes_to_use = 'all'
adata_trains, adata_pred, adata_orig = bc.tl.auto_annot.read_data(train_paths = train_dataset_paths,train_datasets= train_datasets, test_path= test_dataset_path, test_dataset= test_dataset, use_raw = use_raw)
This function merges training datasets, removes unwanted genes, and if scanorama is used corrects for datasets.
adata_train, adata_pred = bc.tl.auto_annot.merge_data(adata_trains, adata_pred, genes_to_use = genes_to_use, merge = merge)
The returned scaler is fitted on the training dataset (to zero mean and scaled to unit variance).
classifier, scaler = bc.tl.auto_annot.fit(adata_train, method, celltype)
Use fitted model to predict celltypes in adata_pred. Prediction will be added in a new column called 'auto_annot'. Paths are needed as adata_pred will revert to its original state (all genes, no additional corrections). The threshold should be set to 0 or left out for SVM. For logisitic regression the threshold can be set.
adata_predicted = bc.tl.auto_annot.adata_predict(classifier = classifier, scaler = scaler, adata_pred = adata_pred, adata_orig = adata_orig, threshold = 0.1)
Write out metrics to a report file, create confusion matrices and comparative umap plots
%matplotlib inline
bc.tl.auto_annot.report(adata_predicted, celltype, method, analysis_name, train_datasets, test_dataset, False, merge, use_raw, genes_to_use, clustering = 'leiden')
import scanpy as sc
sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot'])
sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot'], legend_loc='on data', legend_fontsize=6)
adata_train